Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 703
Filtrar
1.
Comput Math Methods Med ; 2022: 7191684, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35242211

RESUMO

Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas/genética , Sequência de Aminoácidos , Proteínas de Bactérias/genética , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Análise Discriminante , Evolução Molecular , Helicobacter pylori/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Aprendizado de Máquina , Matrizes de Pontuação de Posição Específica , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Máquina de Vetores de Suporte
2.
Comput Math Methods Med ; 2022: 8691646, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35126641

RESUMO

Task scheduling in parallel multiple sequence alignment (MSA) through improved dynamic programming optimization speeds up alignment processing. The increased importance of multiple matching sequences also needs the utilization of parallel processor systems. This dynamic algorithm proposes improved task scheduling in case of parallel MSA. Specifically, the alignment of several tertiary structured proteins is computationally complex than simple word-based MSA. Parallel task processing is computationally more efficient for protein-structured based superposition. The basic condition for the application of dynamic programming is also fulfilled, because the task scheduling problem has multiple possible solutions or options. Search space reduction for speedy processing of this algorithm is carried out through greedy strategy. Performance in terms of better results is ensured through computationally expensive recursive and iterative greedy approaches. Any optimal scheduling schemes show better performance in heterogeneous resources using CPU or GPU.


Assuntos
Algoritmos , Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Biologia Computacional/estatística & dados numéricos , Humanos , Alinhamento de Sequência/estatística & dados numéricos , Software
3.
J Comput Biol ; 29(2): 155-168, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35108101

RESUMO

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.


Assuntos
Mutação , Análise de Sequência de DNA/estatística & dados numéricos , Algoritmos , Sequência de Bases , Biologia Computacional , Intervalos de Confiança , Genômica/estatística & dados numéricos , Humanos , Modelos Genéticos , Alinhamento de Sequência/estatística & dados numéricos , Software
4.
J Comput Biol ; 29(1): 19-22, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34985990

RESUMO

Although the availability of various sequencing technologies allows us to capture different genome properties at single-cell resolution, with the exception of a few co-assaying technologies, applying different sequencing assays on the same single cell is impossible. Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data. First, it preserves the local geometry by constructing a k-nearest neighbor (k-NN) graph for each data set (or domain) to capture the intra-domain distances. SCOT then finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection, thus aligning them. SCOT requires tuning only two hyperparameters and is robust to the choice of one. Furthermore, the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available. Thus, SCOT is a fast and accurate alignment method that provides a heuristic for hyperparameter selection in a real-world unsupervised single-cell data alignment scenario. We provide a tutorial for SCOT and make its source code publicly available on GitHub.


Assuntos
Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genômica/estatística & dados numéricos , Heurística , Humanos , Redes Neurais de Computação , Análise de Sequência/estatística & dados numéricos , Software , Aprendizado de Máquina não Supervisionado
5.
J Comput Biol ; 29(2): 92-105, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35073170

RESUMO

Template-based modeling (TBM), including homology modeling and protein threading, is one of the most reliable techniques for protein structure prediction. It predicts protein structure by building an alignment between the query sequence under prediction and the templates with solved structures. However, it is still very challenging to build the optimal sequence-template alignment, especially when only distantly related templates are available. Here we report a novel deep learning approach ProALIGN that can predict much more accurate sequence-template alignment. Like protein sequences consisting of sequence motifs, protein alignments are also composed of frequently occurring alignment motifs with characteristic patterns. Alignment motifs are context-specific as their characteristic patterns are tightly related to sequence contexts of the aligned regions. Inspired by this observation, we represent a protein alignment as a binary matrix (in which 1 denotes an aligned residue pair) and then use a deep convolutional neural network to predict the optimal alignment from the query protein and its template. The trained neural network implicitly but effectively encodes an alignment scoring function, which reduces inaccuracies in the handcrafted scoring functions widely used by the current threading approaches. For a query protein and a template, we apply the neural network to directly infer likelihoods of all possible residue pairs in their entirety, which could effectively consider the correlations among multiple residues. We further construct the alignment with maximum likelihood, and finally build a structure model according to the alignment. Tested on three independent data sets with a total of 6688 protein alignment targets and 80 CASP13 TBM targets, our method achieved much better alignments and 3D structure models than the existing methods, including HHpred, CNFpred, CEthreader, and DeepThreader. These results clearly demonstrate the effectiveness of exploiting the context-specific alignment motifs by deep learning for protein threading.


Assuntos
Aprendizado Profundo , Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Motivos de Aminoácidos , Sequência de Aminoácidos , Biologia Computacional , Modelos Moleculares , Redes Neurais de Computação , Conformação Proteica , Proteínas/genética , Análise de Sequência de Proteína/estatística & dados numéricos , Software
6.
J Comput Biol ; 29(1): 3-18, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-35050714

RESUMO

Recent advances in sequencing technologies have allowed us to capture various aspects of the genome at single-cell resolution. However, with the exception of a few of co-assaying technologies, it is not possible to simultaneously apply different sequencing assays on the same single cell. In this scenario, computational integration of multi-omic measurements is crucial to enable joint analyses. This integration task is particularly challenging due to the lack of sample-wise or feature-wise correspondences. We present single-cell alignment with optimal transport (SCOT), an unsupervised algorithm that uses the Gromov-Wasserstein optimal transport to align single-cell multi-omics data sets. SCOT performs on par with the current state-of-the-art unsupervised alignment methods, is faster, and requires tuning of fewer hyperparameters. More importantly, SCOT uses a self-tuning heuristic to guide hyperparameter selection based on the Gromov-Wasserstein distance. Thus, in the fully unsupervised setting, SCOT aligns single-cell data sets better than the existing methods without requiring any orthogonal correspondence information.


Assuntos
Algoritmos , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Modelos Estatísticos , Aprendizado de Máquina não Supervisionado
7.
J Comput Biol ; 29(2): 169-187, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35041495

RESUMO

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.


Assuntos
Algoritmos , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Bacteriano , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Salmonella/genética , Análise de Sequência de DNA/estatística & dados numéricos , Análise de Ondaletas
8.
J Comput Biol ; 29(2): 188-194, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35041518

RESUMO

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.


Assuntos
Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Humano , Genômica/estatística & dados numéricos , Humanos , Análise de Sequência de DNA/estatística & dados numéricos
9.
PLoS Comput Biol ; 17(12): e1009632, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34905538

RESUMO

SHAPE-JuMP is a concise strategy for identifying close-in-space interactions in RNA molecules. Nucleotides in close three-dimensional proximity are crosslinked with a bi-reactive reagent that covalently links the 2'-hydroxyl groups of the ribose moieties. The identities of crosslinked nucleotides are determined using an engineered reverse transcriptase that jumps across crosslinked sites, resulting in a deletion in the cDNA that is detected using massively parallel sequencing. Here we introduce ShapeJumper, a bioinformatics pipeline to process SHAPE-JuMP sequencing data and to accurately identify through-space interactions, as observed in complex JuMP datasets. ShapeJumper identifies proximal interactions with near-nucleotide resolution using an alignment strategy that is optimized to tolerate the unique non-templated reverse-transcription profile of the engineered crosslink-traversing reverse-transcriptase. JuMP-inspired strategies are now poised to replace adapter-ligation for detecting RNA-RNA interactions in most crosslinking experiments.


Assuntos
DNA Complementar/química , RNA/química , Software , Algoritmos , Sítios de Ligação , Biologia Computacional , Reagentes de Ligações Cruzadas , DNA Complementar/genética , Engenharia Genética , Modelos Moleculares , Conformação de Ácido Nucleico , RNA/genética , Alinhamento de Sequência/estatística & dados numéricos
10.
Comput Math Methods Med ; 2021: 5548993, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34777564

RESUMO

The development of high-throughput technology has provided a reliable technical guarantee for an increased amount of available data on biological networks. Network alignment is used to analyze these data to identify conserved functional network modules and understand evolutionary relationships across species. Thus, an efficient computational network aligner is needed for network alignment. In this paper, the classic bat algorithm is discretized and applied to the network alignment. The bat algorithm initializes the population randomly and then searches for the optimal solution iteratively. Based on the bat algorithm, the global pairwise alignment algorithm BatAlign is proposed. In BatAlign, the individual velocity and the position are represented by a discrete code. BatAlign uses a search algorithm based on objective function that uses the number of conserved edges as the objective function. The similarity between the networks is used to initialize the population. The experimental results showed that the algorithm was able to match proteins with high functional consistency and reach a relatively high topological quality.


Assuntos
Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Animais , Quirópteros/fisiologia , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas , Ecolocação/fisiologia , Ontologia Genética , Redes Reguladoras de Genes , Humanos , Redes e Vias Metabólicas , Mapas de Interação de Proteínas , Biologia Sintética
11.
PLoS Comput Biol ; 17(8): e1008904, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34339413

RESUMO

The killer-cell immunoglobulin-like receptor (KIR) complex on chromosome 19 encodes receptors that modulate the activity of natural killer cells, and variation in these genes has been linked to infectious and autoimmune disease, as well as having bearing on pregnancy and transplant outcomes. The medical relevance and high variability of KIR genes makes short-read sequencing an attractive technology for interrogating the region, providing a high-throughput, high-fidelity sequencing method that is cost-effective. However, because this gene complex is characterized by extensive nucleotide polymorphism, structural variation including gene fusions and deletions, and a high level of homology between genes, its interrogation at high resolution has been thwarted by bioinformatic challenges, with most studies limited to examining presence or absence of specific genes. Here, we present the PING (Pushing Immunogenetics to the Next Generation) pipeline, which incorporates empirical data, novel alignment strategies and a custom alignment processing workflow to enable high-throughput KIR sequence analysis from short-read data. PING provides KIR gene copy number classification functionality for all KIR genes through use of a comprehensive alignment reference. The gene copy number determined per individual enables an innovative genotype determination workflow using genotype-matched references. Together, these methods address the challenges imposed by the structural complexity and overall homology of the KIR complex. To determine copy number and genotype determination accuracy, we applied PING to European and African validation cohorts and a synthetic dataset. PING demonstrated exceptional copy number determination performance across all datasets and robust genotype determination performance. Finally, an investigation into discordant genotypes for the synthetic dataset provides insight into misaligned reads, advancing our understanding in interpretation of short-read sequencing data in complex genomic regions. PING promises to support a new era of studies of KIR polymorphism, delivering high-resolution KIR genotypes that are highly accurate, enabling high-quality, high-throughput KIR genotyping for disease and population studies.


Assuntos
Imunogenética/estatística & dados numéricos , Receptores KIR/genética , África Austral , Alelos , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Europa (Continente) , Dosagem de Genes , Genética Populacional/estatística & dados numéricos , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Polimorfismo Genético , Receptores KIR/classificação , Alinhamento de Sequência/estatística & dados numéricos , Design de Software
12.
PLoS Comput Biol ; 17(6): e1009078, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-34153026

RESUMO

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).


Assuntos
Mapeamento de Sequências Contíguas/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Software , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Variação Genética , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Programação Linear , Análise de Sequência de DNA
13.
Sci Rep ; 11(1): 7574, 2021 04 07.
Artigo em Inglês | MEDLINE | ID: mdl-33828153

RESUMO

Protein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA's feature at the inter-residue level, we added an attention layer to the deep neural network. We show that combining four MSAs of different E-value cutoffs improved the model prediction performance as compared to single E-value MSA features. A further improvement was observed when an attention layer was used and even more when additional prediction tasks of bond angle predictions were added. The improvement of distance predictions were successfully transferred to achieve better protein tertiary structure modeling.


Assuntos
Aprendizado Profundo , Proteínas/química , Alinhamento de Sequência/métodos , Caspases/química , Caspases/genética , Modelos Moleculares , Redes Neurais de Computação , Domínios e Motivos de Interação entre Proteínas , Estrutura Terciária de Proteína , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína
14.
Sci Rep ; 11(1): 3791, 2021 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-33589693

RESUMO

The increasing number of available genomic data allowed the development of phylogenomic analytical tools. Current methods compile information from single gene phylogenies, whether based on topologies or multiple sequence alignments. Generally, phylogenomic analyses elect gene families or genomic regions to construct phylogenomic trees. Here, we presented an alternative approach for Phylogenomics, named TOMM (Total Ortholog Median Matrix), to construct a representative phylogram composed by amino acid distance measures of all pairwise ortholog protein sequence pairs from desired species inside a group of organisms. The procedure is divided two main steps, (1) ortholog detection and (2) creation of a matrix with the median amino acid distance measures of all pairwise orthologous sequences. We tested this approach within three different group of organisms: Kinetoplastida protozoa, hematophagous Diptera vectors and Primates. Our approach was robust and efficacious to reconstruct the phylogenetic relationships for the three groups. Moreover, novel branch topologies could be achieved, providing insights about some phylogenetic relationships between some taxa.


Assuntos
Evolução Molecular , Genômica/estatística & dados numéricos , Fases de Leitura Aberta/genética , Filogenia , Genoma/genética , Alinhamento de Sequência/estatística & dados numéricos
15.
PLoS Comput Biol ; 16(11): e1008383, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33166275

RESUMO

In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Arioc, a GPU-accelerated short-read aligner, can compute WGS (whole-genome sequencing) alignments ten times faster than comparable CPU-only alignment software. When two or more GPUs are available, Arioc's speed increases proportionately because the software executes concurrently on each available GPU device. We have adapted Arioc to recent multi-GPU hardware architectures that support high-bandwidth peer-to-peer memory accesses among multiple GPUs. By modifying Arioc's implementation to exploit this GPU memory architecture we obtained a further 1.8x-2.9x increase in overall alignment speeds. With this additional acceleration, Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run-over 500 million 150nt paired-end reads-in less than 15 minutes. As WGS data accumulates exponentially and high-concurrency computational resources become widespread, Arioc addresses a growing need for timely computation in the short-read data analysis toolchain.


Assuntos
Alinhamento de Sequência/métodos , Software , Algoritmos , Sequência de Bases , Biologia Computacional , Gráficos por Computador , Computadores , Bases de Dados de Ácidos Nucleicos , Humanos , Armazenamento e Recuperação da Informação , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de DNA , Sequenciamento Completo do Genoma
16.
PLoS One ; 15(8): e0233673, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32750050

RESUMO

Computational algorithms are often used to assess pathogenicity of Variants of Uncertain Significance (VUS) that are found in disease-associated genes. Most computational methods include analysis of protein multiple sequence alignments (PMSA), assessing interspecies variation. Careful validation of PMSA-based methods has been done for relatively few genes, partially because creation of curated PMSAs is labor-intensive. We assessed how PMSA-based computational tools predict the effects of the missense changes in the APC gene, in which pathogenic variants cause Familial Adenomatous Polyposis. Most Pathogenic or Likely Pathogenic APC variants are protein-truncating changes. However, public databases now contain thousands of variants reported as missense. We created a curated APC PMSA that contained >3 substitutions/site, which is large enough for statistically robust in silico analysis. The creation of the PMSA was not easily automated, requiring significant querying and computational analysis of protein and genome sequences. Of 1924 missense APC variants in the NCBI ClinVar database, 1800 (93.5%) are reported as VUS. All but two missense variants listed as P/LP occur at canonical splice or Exonic Splice Enhancer sites. Pathogenicity predictions by five computational tools (Align-GVGD, SIFT, PolyPhen2, MAPP, REVEL) differed widely in their predictions of Pathogenic/Likely Pathogenic (range 17.5-75.0%) and Benign/Likely Benign (range 25.0-82.5%) for APC missense variants in ClinVar. When applied to 21 missense variants reported in ClinVar and securely classified as Benign, the five methods ranged in accuracy from 76.2-100%. Computational PMSA-based methods can be an excellent classifier for variants of some hereditary cancer genes. However, there may be characteristics of the APC gene and protein that confound the results of in silico algorithms. A systematic study of these features could greatly improve the automation of alignment-based techniques and the use of predictive algorithms in hereditary cancer genes.


Assuntos
Proteína da Polipose Adenomatosa do Colo/genética , Polipose Adenomatosa do Colo/genética , Genes APC , Mutação de Sentido Incorreto , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas , Elementos Facilitadores Genéticos , Evolução Molecular , Éxons , Variação Genética , Humanos , Filogenia , Isoformas de Proteínas/genética , Sítios de Splice de RNA , Alinhamento de Sequência/estatística & dados numéricos
17.
Bull Math Biol ; 82(2): 21, 2020 01 22.
Artigo em Inglês | MEDLINE | ID: mdl-31970502

RESUMO

In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and whose branchings indicate past speciation events. Phylogenetic analyses often rely on molecular sequences, such as DNA sequences, collected from the species of interest, and it is common in this context to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides-also known as indels. Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here, we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first explicit reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the "big bang" condition, a necessary and sufficient condition for statistical consistency in this setting.


Assuntos
DNA/genética , Modelos Genéticos , Algoritmos , Sequência de Bases , Biologia Computacional , Simulação por Computador , Evolução Molecular , Mutação INDEL , Funções Verossimilhança , Cadeias de Markov , Conceitos Matemáticos , Modelos Estatísticos , Filogenia , Alinhamento de Sequência/estatística & dados numéricos
18.
J Comput Biol ; 27(9): 1361-1372, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-31913652

RESUMO

Sequence alignment is a fundamental concept in bioinformatics to distinguish regions of similarity among various sequences. The degree of similarity has been considered as a score. There are a number of various methods to find the statistical significance of similarity in the gapped and ungapped cases. In this article, we improve the statistical significance accuracy of the local score by introducing a new approximate p-value. This is developed according to Poisson clumping and the exact distribution of a partial sum of random variables. The efficiency of the proposed method is compared with that of previous methods on real and simulated data. The results yield a remarkable improvement in accuracy of the p-value in the gapped case. This is an evidence for the method to be considered as a prospective candidate for sequences comparison.


Assuntos
Sequência de Aminoácidos/genética , Biologia Computacional/estatística & dados numéricos , Modelos Estatísticos , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Probabilidade
19.
Immunogenetics ; 72(1-2): 49-55, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-31641782

RESUMO

The Immuno Polymorphism Database (IPD), https://www.ebi.ac.uk/ipd/, is a set of specialist databases that enable the study of polymorphic genes which function as part of the vertebrate immune system. The major focus is on the hyperpolymorphic major histocompatibility complex (MHC) genes and the killer-cell immunoglobulin-like receptor (KIR) genes, by providing the official repository and primary source of sequence data. Databases are centred around humans as well as animals important for food security, for companionship and as disease models. The IPD project works with specialist groups or nomenclature committees who provide and manually curate individual sections before they are submitted for online publication. To reflect the recent advance of allele sequencing technologies and the increasing demands of novel tools for the analysis of genomic variation, the IPD project is undergoing a progressive redesign and reorganisation. In this review, recent updates and future developments are discussed, with a focus on the core concepts to better future-proof the project.


Assuntos
Antígenos de Plaquetas Humanas/genética , Complexo Principal de Histocompatibilidade/genética , Biologia Computacional/métodos , Bases de Dados como Assunto , Bases de Dados Factuais , Bases de Dados Genéticas , Epitopos de Linfócito T/genética , Antígenos HLA/genética , Humanos , Imunidade/genética , Polimorfismo Genético/genética , Alinhamento de Sequência/estatística & dados numéricos
20.
PLoS Comput Biol ; 15(12): e1007219, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31846452

RESUMO

The most frequently used approach for protein structure prediction is currently homology modeling. The 3D model building phase of this methodology is critical for obtaining an accurate and biologically useful prediction. The most widely employed tool to perform this task is MODELLER. This program implements the "modeling by satisfaction of spatial restraints" strategy and its core algorithm has not been altered significantly since the early 1990s. In this work, we have explored the idea of modifying MODELLER with two effective, yet computationally light strategies to improve its 3D modeling performance. Firstly, we have investigated how the level of accuracy in the estimation of structural variability between a target protein and its templates in the form of σ values profoundly influences 3D modeling. We show that the σ values produced by MODELLER are on average weakly correlated to the true level of structural divergence between target-template pairs and that increasing this correlation greatly improves the program's predictions, especially in multiple-template modeling. Secondly, we have inquired into how the incorporation of statistical potential terms (such as the DOPE potential) in the MODELLER's objective function impacts positively 3D modeling quality by providing a small but consistent improvement in metrics such as GDT-HA and lDDT and a large increase in stereochemical quality. Python modules to harness this second strategy are freely available at https://github.com/pymodproject/altmod. In summary, we show that there is a large room for improving MODELLER in terms of 3D modeling quality and we propose strategies that could be pursued in order to further increase its performance.


Assuntos
Modelos Moleculares , Software , Homologia Estrutural de Proteína , Algoritmos , Biologia Computacional , Simulação de Dinâmica Molecular/estatística & dados numéricos , Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...